Introduction


Data Analysis with multivariate Linear Regression on data about voting for the 2006 and 2010 elections in Brazil for the “Câmara Federal de Deputados”. Data was taken from the TSE portal




Data Overview


Loading Data

eleicoes_data <- readr::read_csv(
  here::here('data/eleicoes_2006_e_2010.csv'), 
  local=readr::locale("br"),
  col_types = cols(
    ano = col_integer(),
    sequencial_candidato = col_character(),
    quantidade_doacoes = col_integer(),
    quantidade_doadores = col_integer(),
    total_receita = col_double(),
    media_receita = col_double(),
    recursos_de_outros_candidatos.comites = col_double(),
    recursos_de_pessoas_fisicas = col_double(),
    recursos_de_pessoas_juridicas = col_double(),
    recursos_proprios = col_double(),
    `recursos_de_partido_politico` = col_double(),
    quantidade_despesas = col_integer(),
    quantidade_fornecedores = col_integer(),
    total_despesa = col_double(),
    media_despesa = col_double(),
    votos = col_integer(),
    .default = col_character()))
# Let's put everything in Upper case for uniformity
eleicoes_data %>% 
  mutate(nome = toupper(nome),
         sexo = toupper(sexo),
         grau = toupper(grau),
         nome = toupper(nome),
         cargo = toupper(cargo),
         ocupacao = toupper(ocupacao),
         partido = toupper(partido),
         estado_civil = toupper(estado_civil),
         sequencial_candidato = as.numeric(sequencial_candidato)) -> eleicoes_data

# Adding surrogate key to dataframe
eleicoes_data$id <- 1:nrow(eleicoes_data)

eleicoes_data %>% 
  glimpse()
## Observations: 7,476
## Variables: 25
## $ ano                                   <int> 2006, 2006, 2006, 2006, ...
## $ sequencial_candidato                  <dbl> 10001, 10002, 10002, 100...
## $ nome                                  <chr> "JOSÉ LUIZ NOGUEIRA DE S...
## $ uf                                    <chr> "AP", "RO", "AP", "MS", ...
## $ partido                               <chr> "PT", "PT", "PT", "PRONA...
## $ quantidade_doacoes                    <int> 6, 13, 17, 6, 48, 6, 14,...
## $ quantidade_doadores                   <int> 6, 13, 16, 6, 48, 6, 7, ...
## $ total_receita                         <dbl> 16600.00, 22826.00, 1581...
## $ media_receita                         <dbl> 2766.67, 1755.85, 9301.2...
## $ recursos_de_outros_candidatos.comites <dbl> 0.00, 6625.00, 2250.00, ...
## $ recursos_de_pessoas_fisicas           <dbl> 9000.00, 15000.00, 34150...
## $ recursos_de_pessoas_juridicas         <dbl> 6300.00, 1000.00, 62220....
## $ recursos_proprios                     <dbl> 1300.00, 201.00, 59500.0...
## $ recursos_de_partido_politico          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ quantidade_despesas                   <int> 14, 24, 123, 8, 133, 9, ...
## $ quantidade_fornecedores               <int> 14, 23, 108, 8, 120, 9, ...
## $ total_despesa                         <dbl> 16583.60, 20325.99, 1460...
## $ media_despesa                         <dbl> 1184.54, 846.92, 1187.09...
## $ cargo                                 <chr> "DEPUTADO FEDERAL", "DEP...
## $ sexo                                  <chr> "MASCULINO", "FEMININO",...
## $ grau                                  <chr> "ENSINO MÉDIO COMPLETO",...
## $ estado_civil                          <chr> "CASADO(A)", "SOLTEIRO(A...
## $ ocupacao                              <chr> "VEREADOR", "SERVIDOR PÚ...
## $ votos                                 <int> 8579, 2757, 17428, 1193,...
## $ id                                    <int> 1, 2, 3, 4, 5, 6, 7, 8, ...
  • Let’s take a look at the quantitative predictors of interest.
  • sequencial candidato has no meaning attached to it so we’ll skip it.
eleicoes_data %>%
  select(id,
         quantidade_despesas,
         quantidade_fornecedores,
         recursos_de_partido_politico,
         recursos_de_pessoas_juridicas,
         recursos_de_pessoas_fisicas,
         recursos_de_outros_candidatos.comites) %>%
  melt(id=c("id"))  %>%
  ggplot(aes(x = value)) + 
  geom_histogram(bins = 30) + 
  facet_wrap(. ~ variable,
             ncol = 2,
             scales = "free_x")  +
  labs(x="Predictor",y="Absolute Frequency")

  • We can see a positive skew across checked predictors
eleicoes_data %>%
  select(id,
         total_receita,
         media_receita,
         total_despesa,
         media_despesa,
         recursos_proprios,
         quantidade_doacoes,
         quantidade_doadores) %>%
  melt(id=c("id"))  %>%
  ggplot(aes(x = value)) + 
  geom_histogram(bins = 30) + 
  facet_wrap(. ~ variable,
             scales = "free_x") 

  • The overall positive skew extends to these variables as well.


Dealing with positive skew


The standard method to deal with a positive skew is to apply a logarithmic transformation to the affected predictor. However, to apply the aforementioned transformation the predictor must not contain any 0.

eleicoes_data %>%
  select(quantidade_doacoes,
         quantidade_doadores,
         total_receita,
         media_receita,
         recursos_de_outros_candidatos.comites,
         recursos_de_pessoas_fisicas,
         recursos_de_pessoas_juridicas,
         recursos_proprios,
         recursos_de_partido_politico,
         quantidade_despesas,
         quantidade_fornecedores,
         total_despesa,
         media_despesa) %>%
  sapply(., function(x) 0 %in% x) %>%
  as.data.frame(row.names = NULL) %>%
  tibble::rownames_to_column() %>%
  set_colnames(c("predictor","contains_zero")) %>%
  arrange(contains_zero)
  • There are four predictors on which we may apply the logarithmic transformation
# apply logarithmic transformation
eleicoes_data %>%
  mutate(log.quantidade_doacoes = log(quantidade_doacoes),
         log.quantidade_doadores = log(quantidade_doadores),
         log.quantidade_despesas = log(quantidade_despesas),
         log.quantidade_fornecedores = log(quantidade_fornecedores)) -> eleicoes_data


# put all quantitative predictors (of interest) in same scale
eleicoes_data %>%
  mutate_at(.vars = vars(quantidade_doacoes,
                         quantidade_doadores,
                         total_receita,
                         media_receita,
                         log.quantidade_doacoes,
                         log.quantidade_doadores,
                         log.quantidade_despesas,
                         log.quantidade_fornecedores,
                         sequencial_candidato,
                         recursos_de_outros_candidatos.comites,
                         recursos_de_pessoas_fisicas,
                         recursos_de_pessoas_juridicas,
                         recursos_proprios,
                         recursos_de_partido_politico,
                         quantidade_despesas,
                         quantidade_fornecedores,
                         total_despesa,
                         media_despesa),
             .funs = funs(as.numeric(scale(.)))) -> scaled_data


Data Exploration

eleicoes_data %>%
  filter(ano == 2006) %>%
  group_by(partido) %>%
  summarize(n = sum(votos)) %>%
  ggplot(aes(reorder(partido,n), n)) +
  geom_bar(stat = "identity") + 
  theme(axis.text.x = element_text(angle = 90,
                                   hjust = 1)) +
  labs(x="Political Party",
       title="2006 elections",
       y="Number of votes") -> p1

eleicoes_data %>%
  filter(ano == 2010) %>%
  group_by(partido) %>%
  summarize(n = sum(votos)) %>%
  ggplot(aes(reorder(partido,n), n)) +
  geom_bar(stat = "identity") + 
  theme(axis.text.x = element_text(angle = 90,
                                   hjust = 1)) +
  labs(x="Political Party",
       title="2010 elections",
       y="Number of votes") -> p2

grid.arrange(p1, p2, ncol=1)

  • In 2010 the party \(PMDB\) takes the first place from the party \(PT\)
  • The party \(PSDB\) maintains its position at the third place.
eleicoes_data %>%
  ggplot(aes(total_receita)) +
  geom_histogram(bins = 30) +
  labs(x="Total Revenue",
       y="Absolute Frequency") +
  facet_grid(. ~ ano) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

  • We can see a positive skew where small values are very frequent
  • This indicates that for this predictor: \(mode < median < mean\)
eleicoes_data %>%
  ggplot(aes(media_receita)) +
  geom_histogram(bins = 30) +
  labs(x="Mean Revenue",
       y="Absolute Frequency") +
  facet_grid(. ~ ano)

  • We can see a positive skew where small values are very frequent
  • This indicates that for this predictor: \(mode < median < mean\)
eleicoes_data %>%
  ggplot(aes(total_despesa)) +
  geom_histogram(bins = 30) +
  labs(x="Total Expenditure",
       y="Absolute Frequency") +
  facet_grid(. ~ ano)

  • We can see a positive skew where small values are very frequent
  • This indicates that for this predictor: \(mode < median < mean\)
eleicoes_data %>%
  ggplot(aes(media_despesa)) +
  geom_histogram(bins = 30) +
  labs(x="Mean Expenditure",
       y="Absolute Frequency") +
  facet_grid(. ~ ano)

  • We can see a positive skew where small values are very frequent
  • This indicates that for this predictor: \(mode < median < mean\)
eleicoes_data %>%
  ggplot(aes(recursos_proprios)) +
  geom_histogram(bins = 30) +
  labs(x="Total Revenue",
       y="Absolute Frequency") +
  facet_grid(. ~ ano)

  • We can see a positive skew where small values are very frequent
  • This indicates that for this predictor: \(mode < median < mean\)
eleicoes_data %>%
  mutate(ano = as.factor(ano)) %>%
  group_by(estado_civil, ano) %>%
  summarize(n = n()) %>%
  ggplot(aes(reorder(estado_civil,n), n,
             fill= ano)) +
  geom_bar(stat = "identity",
           position = position_dodge(width = 0.5)) + 
  labs(x="Marital status of candidate", 
       y="Absolute Frequency") +
  guides(fill = guide_legend(title = "year")) +
  coord_flip()

2010 overall tops 2006 in the number of candidates of each and every marital status with one exception:

  • We see a decrease of candidates SEPARADO(A) JUDICIALMENTE from 2006 to 2010.
eleicoes_data %>%
  mutate(ano = as.factor(ano)) %>%
  group_by(grau, ano) %>%
  summarize(n = n()) %>%
  ggplot(aes(reorder(grau,n), n,
             fill= ano)) +
  geom_bar(stat = "identity",
           position = position_dodge(width = 0.5)) + 
  labs(x="Education level", 
       y="Absolute Frequency") +
  guides(fill = guide_legend(title = "year")) +
  coord_flip()

2010 overall tops 2006 in the number of candidates of each and every education level with one exceptions:

  • The number of candidates with ENSINO FUNDAMENTAL INCOMPLETO decreased.
eleicoes_data %>%
  group_by(sexo, ano) %>%
  summarize(n = n()) %>%
  ggplot(aes(reorder(sexo,n), n)) +
  geom_bar(stat = "identity") + 
  labs(x="Gender", 
       y="Absolute Frequency") +
  facet_grid(. ~ano)

  • Male predominance is maintained across elections.


Correlogram


We’ll use the correlogram to have an idea on how the predictors interact with each and with the target variable. How the predictors react with the target votos is of particular interest.

  • Transformed predictors will be kept out as the transformation doesn’t impact them in terms of linear correlation.
eleicoes_data %>% 
  filter(ano == 2006) %>%
  select(-partido,
         -uf,-nome,
         -estado_civil,
         -ocupacao,-ano,
         -cargo,-grau,-sexo,
         -log.quantidade_doacoes,
         -log.quantidade_doadores,
         -log.quantidade_despesas,
         -log.quantidade_fornecedores) %>%
  na.omit() %>%
  ggcorr(palette = "RdBu",
         color = "grey50",
         label = TRUE, hjust = 1,
         label_size = 3, size = 4,
         nbreaks = 5, layout.exp = 7) +
  ggtitle("Correlation plot for 2006 elections")

  • sequencial_candidato and id do not look meaning (as expected).
  • total_receita and total_despesa seem very meaningful (especially total_despesa).
eleicoes_data %>% 
  filter(ano == 2010) %>%
  select(-partido,
         -uf,-nome,
         -estado_civil,
         -ocupacao,-ano,
         -cargo,-grau,-sexo,
         -log.quantidade_doacoes,
         -log.quantidade_doadores,
         -log.quantidade_despesas,
         -log.quantidade_fornecedores) %>%
  na.omit() %>%
  ggcorr(palette = "RdBu",
         color = "grey50",
         label = TRUE, hjust = 1,
         label_size = 3, size = 4,
         nbreaks = 5, layout.exp = 7) +
  ggtitle("Correlation plot for 2010 elections")

  • sequencial_candidato and id do not look meaning (as expected).
  • total_receita and total_despesa seem very meaningful.
eleicoes_data %>% 
  select(-partido,
         -uf,-nome,
         -estado_civil,
         -ocupacao,-ano,
         -cargo,-grau,-sexo,
         -log.quantidade_doacoes,
         -log.quantidade_doadores,
         -log.quantidade_despesas,
         -log.quantidade_fornecedores) %>%
  na.omit() %>%
  ggcorr(palette = "RdBu",
         color = "grey50",
         label = TRUE, hjust = 1,
         label_size = 3, size = 4,
         nbreaks = 5, layout.exp = 7) +
  ggtitle("Correlation plot for both elections")

  • sequencial_candidato and id do not look meaning (as expected).
  • total_receita and total_despesa seem very meaningful (especially total_despesa).


Should we use all variables?

A multivariate linear regression model made with all variables wouldn’t be feasible

  • cargo is categorical and renders a one level factor. For this reason we must refrain from using it in our regression.
  • sequencial_candidato and id aren’t meaningful at all as the correlogram clearly showed us.
  • We have nominal categorical variables such as uf that have tenths of levels. A variable like this one demands a one hot encoding which would increase exponentialy the number of variables fed to the model and most likely lead to a overfitting.

For the aforementioned reasons a multivariate linear regression model made with all variables isn’t plausible.




Model for 2006 elections vs Model for 2010 elections

Linear Model for 2006 elections


Split Data for Cross Validation


  • Let’s use the \(50/25/25\) proportions for the train,validate and test datasets
scaled_data %>%  
   filter(ano == 2006) -> scaled_data_2006

scaled_data_2006 %>%
  sample_n(5)
set.seed(11) # We set the set for reason of reproducibility

scaled_data_2006 %>% 
  dplyr::sample_frac(.5) -> train_data_2006

encoding <- build_encoding(dataSet = train_data_2006,
                           cols = c("uf","sexo","grau",
                                    "partido","estado_civil"),
                           verbose = F)

train_data_2006 <- one_hot_encoder(dataSet = train_data_2006,
                           encoding = encoding,
                           drop = TRUE,
                           verbose = F)

cat("#### Train Data ",
    "\n##### Observations: ",nrow(train_data_2006),
    "\n##### Variables: ",ncol(train_data_2006))

Train Data

Observations: 1718
Variables: 93


set.seed(11) # We set the set for reason of reproducibility

dplyr::anti_join(scaled_data_2006, 
                 train_data_2006, 
                 by = 'id') -> intermediate_data

intermediate_data %>% 
  dplyr::sample_frac(.5) -> test_data_2006

test_data_2006 <- one_hot_encoder(dataSet = test_data_2006,
                           encoding = encoding,
                           drop = TRUE,
                           verbose = F)

cat("#### Test Data ",
    "\n##### Observations: ",nrow(test_data_2006),
    "\n##### Variables: ",ncol(test_data_2006))

Test Data

Observations: 859
Variables: 93
set.seed(11) # We set the set for reason of reproducibility

dplyr::anti_join(intermediate_data, 
                 test_data_2006, 
                 by = 'id') -> validate_data_2006


validate_data_2006 <- one_hot_encoder(dataSet = validate_data_2006,
                           encoding = encoding,
                           drop = TRUE,
                           verbose = F)
rm(intermediate_data)

cat("#### Validate Data ",
    "\n##### Observations: ",nrow(validate_data_2006),
    "\n##### Variables: ",ncol(validate_data_2006))

Validate Data

Observations: 859
Variables: 93


Train the model

mod_2006 <- lm(votos ~ quantidade_fornecedores * partido.PSDB + media_despesa * partido.PMDB +
                  total_receita * total_despesa + uf.SP * estado.civil.CASADO.A. + 
                  uf.RJ * total_despesa + total_receita * `grau.SUPERIOR COMPLETO`,
          data = train_data_2006)

broom::glance(mod_2006)
  • As the quality metric come from training they aren’t that meaningful
    • A decent \(R^2\) and \(adjusted \thinspace R^2\).
    • The \(F \thinspace statistic\) (statistic) is way bigger than 1 (a good sign).
broom::tidy(mod_2006,
            conf.int = TRUE,
            conf.level = 0.95)


Linear Model for 2010 elections


Split Data for Cross Validation


  • Let’s use the \(50/25/25\) proportions for the train,validate and test datasets
scaled_data %>%
  filter(ano == 2010) -> scaled_data_2010

scaled_data_2010 %>%
  sample_n(5)
set.seed(11) # We set the set for reason of reproducibility

## Adding surrogate key to dataframe
scaled_data_2010$id <- 1:nrow(scaled_data_2010)

scaled_data_2010 %>% 
  dplyr::sample_frac(.5) -> train_data_2010

encoding <- build_encoding(dataSet = train_data_2010,
                           cols = c("uf","sexo","grau",
                                    "partido","estado_civil"),
                           verbose = F)

train_data_2010 <- one_hot_encoder(dataSet = train_data_2010,
                           encoding = encoding,
                           drop = TRUE,
                           verbose = F)

cat("#### Train Data ",
    "\n##### Observations: ",nrow(train_data_2010),
    "\n##### Variables: ",ncol(train_data_2010))

Train Data

Observations: 2020
Variables: 92


set.seed(11) # We set the set for reason of reproducibility

dplyr::anti_join(scaled_data_2010, 
                 train_data_2010, 
                 by = 'id') -> intermediate_data

intermediate_data %>% 
  dplyr::sample_frac(.5) -> test_data_2010

test_data_2010 <- one_hot_encoder(dataSet = test_data_2010,
                           encoding = encoding,
                           drop = TRUE,
                           verbose = F)

cat("#### Test Data ",
    "\n##### Observations: ",nrow(test_data_2010),
    "\n##### Variables: ",ncol(test_data_2010))

Test Data

Observations: 1010
Variables: 92
set.seed(11) # We set the set for reason of reproducibility

dplyr::anti_join(intermediate_data, 
                 test_data_2010, 
                 by = 'id') -> validate_data_2010

validate_data_2010 <- one_hot_encoder(dataSet = validate_data_2010,
                           encoding = encoding,
                           drop = TRUE,
                           verbose = F)

rm(intermediate_data)

cat("#### Validate Data ",
    "\n##### Observations: ",nrow(validate_data_2010),
    "\n##### Variables: ",ncol(validate_data_2010))

Validate Data

Observations: 1010
Variables: 92


Train the model

mod_2010 <- lm(votos ~ quantidade_fornecedores * partido.PSDB + media_despesa * partido.PMDB +
                  total_receita * total_despesa + uf.SP * estado.civil.CASADO.A. + 
                  uf.RJ * total_despesa + total_receita * `grau.SUPERIOR COMPLETO`,
          data = train_data_2010)

broom::glance(mod_2010)
  • As the quality metric come from training they aren’t that meaningful
    • The \(R^2\) and \(adjusted \thinspace R^2\) are borderline disappointing.
    • The \(F \thinspace statistic\) (statistic) is way bigger than 1 (a good sign).
broom::tidy(mod_2010,
            conf.int = TRUE,
            conf.level = 0.95)


Evaluating Models

Underperforming predictors

broom::tidy(mod_2006, 
     conf.int = TRUE, 
     conf.level = 0.95,
     sep=":") %>%
  arrange(desc(p.value)) %>%
  slice(1:3) %>%
  ggplot(aes(reorder(term,p.value), p.value)) +
  geom_point(size = 2) +
  labs(x = "Predictor variable",
       y = "Estimated p-value",
       title="Underperforming predictors (2006 elections)")

  • Unexpectedly total_receita didn’t perform so well in terms of \(p-value\).
  • The predictors created out of the encoding of the nominal categorical variable partido didn’t perform well.
broom::tidy(mod_2010, 
     conf.int = TRUE, 
     conf.level = 0.95,
     sep=":") %>%
  arrange(desc(p.value)) %>%
  slice(1:3) %>%
  ggplot(aes(reorder(term,p.value), p.value)) +
  geom_point(size = 2) +
  labs(x = "Predictor variable",
       y = "Estimated p-value",
       title="Underperforming predictors (2010 elections)")

  • The predictors created out of the encoding of the nominal categorical variable partido didn’t perform well.
  • The predictors created out of the encoding of the nominal categorical variable estado.civil didn’t perform well.
  • The predictor quantidade_fornecedores didn’t perform well.


On both models we can see that predictors related to partido perform poorly. Also we could see the sizable appearance of categorical variable among underperformer predictors.

Overperforming predictors

  • We shall keep out the Intercept as it isn’t as meaningful as other predictors.
broom::tidy(mod_2006, 
     conf.int = TRUE, 
     conf.level = 0.95,
     sep=":") %>%
  filter(term != "(Intercept)") %>%
  arrange(p.value) %>%
  slice(1:3) %>%
  ggplot(aes(reorder(term,p.value), p.value)) +
  geom_hline(yintercept = 0.05, colour = "darkred") +
  geom_point(size = 2) +
  labs(x = "Predictor variable",
       y = "Estimated p-value",
       title="Overperforming predictors (2006 elections)")

  • We can see the presence of different combinations of the predictors total_receita and total_despesa as the best predictors.
broom::tidy(mod_2010, 
     conf.int = TRUE, 
     conf.level = 0.95,
     sep=":") %>%
  filter(term != "(Intercept)") %>%
  arrange(p.value) %>%
  slice(1:3) %>%
  ggplot(aes(reorder(term,p.value), p.value)) +
  geom_hline(yintercept = 0.05, colour = "darkred") +
  geom_point(size = 2) +
  labs(x = "Predictor variable",
       y = "Estimated p-value",
       title="Overperforming predictors (20010 elections)")

  • We can see the presence of different combinations of the predictors total_receita and total_despesa as the best predictors.


On both models different combinations of the predictors total_receita and total_despesa were clearly the best predictors (those that could explain the votes the most).

Residue Analysis


Residual vs Fitted

mod_2006 %>%
  ggplot(aes(.fitted, .resid)) + 
  geom_point() +
  stat_smooth(method="loess") + 
  geom_hline(col="red",
             yintercept=0,
             linetype="dashed") + 
  labs(y="Residuals",
       x="Fitted values",
       title="Residual vs Fitted Plot (2006 elections)")

  • There’s indication, although not so strong, of a pattern in the residual vs fitted, this is a bad sign for the analyzed model.
mod_2010 %>%
  ggplot(aes(.fitted, .resid)) + 
  geom_point() +
  stat_smooth(method="loess") + 
  geom_hline(col="red",
             yintercept=0,
             linetype="dashed") + 
  labs(y="Residuals",
       x="Fitted values",
       title="Residual vs Fitted Plot (2010 elections)")

  • Here we can see rather equally spread (could be better) residuals around a horizontal line without distinct patterns. A good sign for the model.

Normal Q-Q

mod_2006 %>%
  ggplot(aes(sample=rstandard(.))) +
  stat_qq(na.rm = TRUE,
          shape=1,size=3) +      # open circles
  labs(title="Normal Q-Q (2006 elections)",        # plot title
  x="Theoretical Quantiles",      # x-axis label
  y="Standardized Residuals") +   # y-axis label +
  geom_abline(color = "red",
              size = 0.8,
              linetype="dashed")  # dashed reference line

  • Our residuals somewhat deviate from (not normally distributed). As we have a relatively mild deviation we have a negative yet mild sign for our model.
mod_2010 %>%
  ggplot(aes(sample=rstandard(.))) +
  stat_qq(na.rm = TRUE,
          shape=1,size=3) +      # open circles
  labs(title="Normal Q-Q (2010 elections)",        # plot title
  x="Theoretical Quantiles",      # x-axis label
  y="Standardized Residuals") +   # y-axis label +
  geom_abline(color = "red",
              size = 0.8,
              linetype="dashed")  # dashed reference line

  • Although not perfect we can see that the residuals for this model are closer to normally distributed.

Scale-Location

mod_2006 %>%
  ggplot(aes(.fitted, 
             sqrt(abs(.stdresid)))) + 
  geom_point(na.rm=TRUE) + 
  stat_smooth(method="loess",
              na.rm = TRUE) +
  labs(title = "Scale-Location (2006 elections)",
       x= "Fitted Value",
       y = expression(sqrt("|Standardized residuals|")))

  • The residuals are not equally spread. Our model has violated the assumption of \(homoscedasticity\) (equal variance)
    • The degree of the violation isn’t so extense from around the 25000th fitted value onward the residuals seem very equally spread.
mod_2010 %>%
  ggplot(aes(.fitted, 
             sqrt(abs(.stdresid)))) + 
  geom_point(na.rm=TRUE) + 
  stat_smooth(method="loess",
              na.rm = TRUE) +
  labs(title = "Scale-Location (2010 elections)",
       x= "Fitted Value",
       y = expression(sqrt("|Standardized residuals|")))

  • Here we see a scenario similar to the one we just witnessed, but in the case of this model the residuals are closer to equally spread. This is the sign of a better fit.


Residual vs Leverage Plot


  • We will use as thumb rule \(D_i > 1\) as cutoff for highly influential points/outliers
mod_2006 %>%
  ggplot(aes(.hat, .stdresid)) + 
  geom_point(aes(size=.cooksd), na.rm=TRUE) +
  stat_smooth(method="loess", na.rm=TRUE) +
  xlab("Leverage")+ylab("Standardized Residuals") + 
  ggtitle("Residual vs Leverage Plot (2006 elections)") + 
  scale_size_continuous("Cook's Distance", range=c(1,5)) +    
  theme(legend.position="bottom")

  • We can see no outliers that would drastically change the results of our model.
mod_2010 %>%
  ggplot(aes(.hat, .stdresid)) + 
  geom_point(aes(size=.cooksd), na.rm=TRUE) +
  stat_smooth(method="loess", na.rm=TRUE) +
  xlab("Leverage")+ylab("Standardized Residuals") + 
  ggtitle("Residual vs Leverage Plot (2010 elections)") + 
  scale_size_continuous("Cook's Distance", range=c(1,5)) +    
  theme(legend.position="bottom")

  • Once again we can see no outliers that would drastically change the results of our model.

Cook’s dist vs Leverage


  • We will use as thumb rule \(D_i > 1\) as cutoff for highly influential points/outliers
mod_2006 %>%
  ggplot(aes(.hat, .cooksd)) + 
  geom_point(na.rm=TRUE) + 
  stat_smooth(method="loess", na.rm=TRUE) + 
  xlab("Leverage hii")+ylab("Cook's Distance") + 
  ggtitle("Cook's dist vs Leverage hii/(1-hii) (2006 elections)") + 
  geom_abline(slope=seq(0,3,0.5), color="gray", linetype="dashed")

  • We can see no outliers that would drastically change the results of our model.
mod_2010 %>%
  ggplot(aes(.hat, .cooksd)) + 
  geom_point(na.rm=TRUE) + 
  stat_smooth(method="loess", na.rm=TRUE) + 
  xlab("Leverage hii")+ylab("Cook's Distance") + 
  ggtitle("Cook's dist vs Leverage hii/(1-hii) (2010 elections)") + 
  geom_abline(slope=seq(0,3,0.5), color="gray", linetype="dashed")

  • Once again we can see no outliers that would drastically change the results of our model.

Cross Validation

predictions <- mod_2006 %>% predict(validate_data_2006)

data.frame( R2 = caret::R2(predictions, validate_data_2006$votos),
            RMSE = caret::RMSE(predictions, validate_data_2006$votos),
            MAE = caret::MAE(predictions, validate_data_2006$votos),
            ERR = caret::RMSE(predictions, validate_data_2006$votos)/
              mean(validate_data_2006$votos))

Now let’s talk about the results taken from the validate data for 2006 elections (more meaningful).

  • We got a decent 0.6 R² and adjusted R² approximately (notice the decrease). This means that this model explain approximately 60% of the response variable variability.

The average difference between the observed known outcome values and the values predicted by the model (RMSE) was of approximately 29326.09. + Our model would miss the mark by approximately 29326 (RMSE), that is if candidate had one million votes we would predict 29326 more than we should (or less than we should). The average absolute difference between observed and predicted outcomes (MAE) was approximately 13022.77. * The prediction error rate (ERR) was 1.263851.

predictions <- mod_2010 %>% predict(validate_data_2010)

data.frame( R2 = caret::R2(predictions, validate_data_2010$votos),
            RMSE = caret::RMSE(predictions, validate_data_2010$votos),
            MAE = caret::MAE(predictions, validate_data_2010$votos),
            ERR = caret::RMSE(predictions, validate_data_2010$votos)/
              mean(validate_data_2010$votos))

Now let’s talk about the results taken from the validate data for 2010 elections (more meaningful).

  • We got a decent 0.54 R² and adjusted R² approximately (notice the decrease). This means that this model explain approximately 54% of the response variable variability. (Smaller than the previous model)

The average difference between the observed known outcome values and the values predicted by the model (RMSE) was of approximately 30501.16 + Our model would miss the mark by approximately 30501 (RMSE), that is if candidate had one million votes we would predict 30501 more than we should (or less than we should). The average absolute difference between observed and predicted outcomes (MAE) was approximately 13789.98. * The prediction error rate (ERR) was 1.409471.

We have signs here that the model for 2006 elections fit the train set a little too much (better residual results but worse validation results).

predictions <- mod_2006 %>% predict(test_data_2006)

data.frame( R2 = caret::R2(predictions, test_data_2006$votos),
            RMSE = caret::RMSE(predictions, test_data_2006$votos),
            MAE = caret::MAE(predictions, test_data_2006$votos),
            ERR = caret::RMSE(predictions, test_data_2006$votos)/
              mean(test_data_2006$votos))

Now let’s talk about the results taken from the test data for 2006 elections (most meaningful).

  • We got a decent 0.5 R² and adjusted R² approximately (notice the decrease). This means that this model explain approximately 50% of the response variable variability. (Smaller than the previous model)

The average difference between the observed known outcome values and the values predicted by the model (RMSE) was of approximately 31605.85 + Our model would miss the mark by approximately 31606 (RMSE), that is if candidate had one million votes we would predict 31606 more than we should (or less than we should). The average absolute difference between observed and predicted outcomes (MAE) was approximately 13832.72. * The prediction error rate (ERR) was 1.350674.

predictions <- mod_2010 %>% predict(test_data_2010)

data.frame( R2 = caret::R2(predictions, test_data_2010$votos),
            RMSE = caret::RMSE(predictions, test_data_2010$votos),
            MAE = caret::MAE(predictions, test_data_2010$votos),
            ERR = caret::RMSE(predictions, test_data_2010$votos)/
              mean(test_data_2010$votos))

Now let’s talk about the results taken from the test data for 2006 elections (most meaningful).

  • We got a decent 0.53 R² and adjusted R² approximately (notice the decrease). This means that this model explain approximately 53% of the response variable variability. (Smaller than the previous model)

The average difference between the observed known outcome values and the values predicted by the model (RMSE) was of approximately 29033.86 + Our model would miss the mark by approximately 29034 (RMSE), that is if candidate had one million votes we would predict 29034 more than we should (or less than we should). The average absolute difference between observed and predicted outcomes (MAE) was approximately 12804.09. * The prediction error rate (ERR) was 1.461269.

At the end the model proved a better fit (a little better) for the 2010 elections than for the 2006 elections.

Removing redundant predictors

total_receita and total_despesa were clearly the best predictors (those that could explain the votes the most), the rest of the predictors, such as partido, estado.civil and quantidade_fornecedores were redundant. For this reason we’ll try to create a model using only the best predictors.

2006 elections skimmed model

mod_2006 <- lm(votos ~ total_receita * total_despesa,
          data = train_data_2006)

broom::glance(mod_2006)
  • As the quality metric come from training they aren’t that meaningful
    • A decent \(R^2\) and \(adjusted \thinspace R^2\) very close to the original model.
    • The \(F \thinspace statistic\) (statistic) is way bigger than 1 (a good sign) and better than the one from the original model.

Residue Analysis

mod_2006 %>%
  ggplot(aes(.fitted, .resid)) + 
  geom_point() +
  stat_smooth(method="loess") + 
  geom_hline(col="red",
             yintercept=0,
             linetype="dashed") + 
  labs(y="Residuals",
       x="Fitted values",
       title="Residual vs Fitted Plot (2006 elections)")

  • There’s indication, although not so strong, of a pattern in the residual vs fitted, this is a bad sign for the analyzed model.
    • Pattern stronger than the one from the original model.
mod_2006 %>%
  ggplot(aes(sample=rstandard(.))) +
  stat_qq(na.rm = TRUE,
          shape=1,size=3) +      # open circles
  labs(title="Normal Q-Q (2006 elections)",        # plot title
  x="Theoretical Quantiles",      # x-axis label
  y="Standardized Residuals") +   # y-axis label +
  geom_abline(color = "red",
              size = 0.8,
              linetype="dashed")  # dashed reference line

  • Our residuals somewhat deviate from (not normally distributed). As we have a relatively mild deviation we have a negative yet mild sign for our model.
    • Results similar to original model.
mod_2006 %>%
  ggplot(aes(.fitted, 
             sqrt(abs(.stdresid)))) + 
  geom_point(na.rm=TRUE) + 
  stat_smooth(method="loess",
              na.rm = TRUE) +
  labs(title = "Scale-Location (2006 elections)",
       x= "Fitted Value",
       y = expression(sqrt("|Standardized residuals|")))

  • The residuals are not equally spread. Our model has violated the assumption of \(homoscedasticity\) (equal variance)
    • Violation slightly stronger here.
mod_2006 %>%
  ggplot(aes(.hat, .stdresid)) + 
  geom_point(aes(size=.cooksd), na.rm=TRUE) +
  stat_smooth(method="loess", na.rm=TRUE) +
  xlab("Leverage")+ylab("Standardized Residuals") + 
  ggtitle("Residual vs Leverage Plot (2006 elections)") + 
  scale_size_continuous("Cook's Distance", range=c(1,5)) +    
  theme(legend.position="bottom")

  • We can see no outliers that would drastically change the results of our model.
mod_2006 %>%
  ggplot(aes(.hat, .cooksd)) + 
  geom_point(na.rm=TRUE) + 
  stat_smooth(method="loess", na.rm=TRUE) + 
  xlab("Leverage hii")+ylab("Cook's Distance") + 
  ggtitle("Cook's dist vs Leverage hii/(1-hii) (2006 elections)") + 
  geom_abline(slope=seq(0,3,0.5), color="gray", linetype="dashed")

  • Here we see indication of a somewhat more influential data point (still within acceptable range).

Cross Validation

predictions <- mod_2006 %>% predict(validate_data_2006)

data.frame( R2 = caret::R2(predictions, validate_data_2006$votos),
            RMSE = caret::RMSE(predictions, validate_data_2006$votos),
            MAE = caret::MAE(predictions, validate_data_2006$votos),
            ERR = caret::RMSE(predictions, validate_data_2006$votos)/
              mean(validate_data_2006$votos))

Now let’s talk about the results taken from the validate data (more meaningful).

  • We got a decent 0.58 R² and adjusted R² approximately. This means that this model explain approximately 58% of the response variable variability.
  • The average difference between the observed known outcome values and the values predicted by the model (RMSE) was of approximately 30180.23.
    • Our model would miss the mark by approximately 30180 (RMSE), that is if candidate had one million votes we would predict 30180 more than we should (or less than we should).
  • The average absolute difference between observed and predicted outcomes (MAE) was approximately 13105.71.
  • The prediction error rate (ERR) was 1.300661.
predictions <- mod_2006 %>% predict(test_data_2006)

data.frame( R2 = caret::R2(predictions, test_data_2006$votos),
            RMSE = caret::RMSE(predictions, test_data_2006$votos),
            MAE = caret::MAE(predictions, test_data_2006$votos),
            ERR = caret::RMSE(predictions, test_data_2006$votos)/
              mean(test_data_2006$votos))

Now let’s talk about the results taken from the validate data (more meaningful).

  • We got a decent 0.52 R² and adjusted R² approximately. This means that this model explain approximately 52% of the response variable variability.
  • The average difference between the observed known outcome values and the values predicted by the model (RMSE) was of approximately 30919.08.
    • Our model would miss the mark by approximately 30919 (RMSE), that is if candidate had one million votes we would predict 30919 more than we should (or less than we should).
  • The average absolute difference between observed and predicted outcomes (MAE) was approximately 13605.23.
  • The prediction error rate (ERR) was 1.321325.

Our skimmed model fared worse on validation but had a slightly better result in test than the original model.

2010 elections skimmed model

mod_2010 <- lm(votos ~ total_receita * total_despesa,
          data = train_data_2010)

broom::glance(mod_2010)
  • As the quality metric come from training they aren’t that meaningful
    • A decent \(R^2\) and \(adjusted \thinspace R^2\) very close to the original model.
    • The \(F \thinspace statistic\) (statistic) is way bigger than 1 (a good sign) and better than the one from the original model.

Residue Analysis


mod_2010 %>%
  ggplot(aes(.fitted, .resid)) + 
  geom_point() +
  stat_smooth(method="loess") + 
  geom_hline(col="red",
             yintercept=0,
             linetype="dashed") + 
  labs(y="Residuals",
       x="Fitted values",
       title="Residual vs Fitted Plot (2010 elections)")

  • Here we can see rather equally spread (could be better) residuals around a horizontal line without distinct patterns. A good sign for the model.
    • Results slightly worse than ones from original model.
mod_2010 %>%
  ggplot(aes(sample=rstandard(.))) +
  stat_qq(na.rm = TRUE,
          shape=1,size=3) +      # open circles
  labs(title="Normal Q-Q (2010 elections)",        # plot title
  x="Theoretical Quantiles",      # x-axis label
  y="Standardized Residuals") +   # y-axis label +
  geom_abline(color = "red",
              size = 0.8,
              linetype="dashed")  # dashed reference line

  • Although not perfect we can see that the residuals for this model are closer to normally distributed.
    • Results slightly worse than ones from original model.
mod_2010 %>%
  ggplot(aes(.fitted, 
             sqrt(abs(.stdresid)))) + 
  geom_point(na.rm=TRUE) + 
  stat_smooth(method="loess",
              na.rm = TRUE) +
  labs(title = "Scale-Location (2010 elections)",
       x= "Fitted Value",
       y = expression(sqrt("|Standardized residuals|")))

  • Here we see a scenario similar to the one we just witnessed, but in the case of this model the residuals are closer to equally spread.
    • Results somewhat better than ones from original model.
mod_2010 %>%
  ggplot(aes(.hat, .stdresid)) + 
  geom_point(aes(size=.cooksd), na.rm=TRUE) +
  stat_smooth(method="loess", na.rm=TRUE) +
  xlab("Leverage")+ylab("Standardized Residuals") + 
  ggtitle("Residual vs Leverage Plot (2010 elections)") + 
  scale_size_continuous("Cook's Distance", range=c(1,5)) +    
  theme(legend.position="bottom")

  • We can see no outliers that would drastically change the results of our model.
mod_2010 %>%
  ggplot(aes(.hat, .cooksd)) + 
  geom_point(na.rm=TRUE) + 
  stat_smooth(method="loess", na.rm=TRUE) + 
  xlab("Leverage hii")+ylab("Cook's Distance") + 
  ggtitle("Cook's dist vs Leverage hii/(1-hii) (2010 elections)") + 
  geom_abline(slope=seq(0,3,0.5), color="gray", linetype="dashed")

  • We can see no outliers that would drastically change the results of our model.

Cross Validation

predictions <- mod_2010 %>% predict(validate_data_2010)

data.frame( R2 = caret::R2(predictions, validate_data_2010$votos),
            RMSE = caret::RMSE(predictions, validate_data_2010$votos),
            MAE = caret::MAE(predictions, validate_data_2010$votos),
            ERR = caret::RMSE(predictions, validate_data_2010$votos)/
              mean(validate_data_2010$votos))

Now let’s talk about the results taken from the validate data (more meaningful).

  • We got a decent 0.62 R² and adjusted R² approximately. This means that this model explain approximately 62% of the response variable variability.
  • The average difference between the observed known outcome values and the values predicted by the model (RMSE) was of approximately 27648.34.
    • Our model would miss the mark by approximately 27648 (RMSE), that is if candidate had one million votes we would predict 27648 more than we should (or less than we should).
  • The average absolute difference between observed and predicted outcomes (MAE) was approximately 13045.81.
  • The prediction error rate (ERR) was 1.277641.
predictions <- mod_2010 %>% predict(test_data_2010)

data.frame( R2 = caret::R2(predictions, test_data_2010$votos),
            RMSE = caret::RMSE(predictions, test_data_2010$votos),
            MAE = caret::MAE(predictions, test_data_2010$votos),
            ERR = caret::RMSE(predictions, test_data_2010$votos)/
              mean(test_data_2010$votos))

Now let’s talk about the results taken from the test data (most meaningful).

  • We got a decent 0.57 R² and adjusted R² approximately (notice the decrease). This means that this model explain approximately 57% of the response variable variability.

  • The average difference between the observed known outcome values and the values predicted by the model (RMSE) was of approximately 27420.97.
    • Our model would miss the mark by approximately 27421 (RMSE), that is if candidate had one million votes we would predict 27421 more than we should (or less than we should).
  • The average absolute difference between observed and predicted outcomes (MAE) was approximately 12391.32.
  • The prediction error rate (ERR) was 1.380093.




Model for both elections


Split data for cross validation


  • Let’s use the \(50/25/25\) proportions for the train,validate and test datasets
set.seed(11) # We set the set for reason of reproducibility

## Adding surrogate key to dataframe
scaled_data$id <- 1:nrow(scaled_data)

scaled_data %>% 
  dplyr::sample_frac(.5) -> train_data

encoding <- build_encoding(dataSet = train_data,
                           cols = c("uf","sexo","grau",
                                    "partido","estado_civil"),
                           verbose = F)

train_data <- one_hot_encoder(dataSet = train_data,
                           encoding = encoding,
                           drop = TRUE,
                           verbose = F)

cat("#### Train Data ",
    "\n##### Observations: ",nrow(train_data),
    "\n##### Variables: ",ncol(train_data))

Train Data

Observations: 3738
Variables: 96


set.seed(11) # We set the set for reason of reproducibility

dplyr::anti_join(scaled_data, 
                 train_data, 
                 by = 'id') -> intermediate_data

intermediate_data %>% 
  dplyr::sample_frac(.5) -> test_data

test_data <- one_hot_encoder(dataSet = test_data,
                           encoding = encoding,
                           drop = TRUE,
                           verbose = F)

cat("#### Test Data ",
    "\n##### Observations: ",nrow(test_data),
    "\n##### Variables: ",ncol(test_data))

Test Data

Observations: 1869
Variables: 96
set.seed(11) # We set the set for reason of reproducibility

dplyr::anti_join(intermediate_data, 
                 test_data, 
                 by = 'id') -> validate_data

validate_data <- one_hot_encoder(dataSet = validate_data,
                           encoding = encoding,
                           drop = TRUE,
                           verbose = F)

rm(intermediate_data)

cat("#### Validate Data ",
    "\n##### Observations: ",nrow(validate_data),
    "\n##### Variables: ",ncol(validate_data))

Validate Data

Observations: 1869
Variables: 96
mod <- lm(votos ~ total_receita * total_despesa,
          data = train_data)

broom::glance(mod)
  • As the quality metric come from training they aren’t that meaningful
    • The \(R^2\) and \(adjusted \thinspace R^2\) are decent.
    • The \(F \thinspace statistic\) (statistic) is way bigger than 1 (a good sign).

Residue Analysis


mod %>%
  ggplot(aes(.fitted, .resid)) + 
  geom_point() +
  stat_smooth(method="loess") + 
  geom_hline(col="red",
             yintercept=0,
             linetype="dashed") + 
  labs(y="Residuals",
       x="Fitted values",
       title="Residual vs Fitted Plot")

  • There’s considerable indication of a pattern in the residual, this is a bad sign for the analyzed model.
mod %>%
  ggplot(aes(sample=rstandard(.))) +
  stat_qq(na.rm = TRUE,
          shape=1,size=3) +      # open circles
  labs(title="Normal Q-Q",        # plot title
  x="Theoretical Quantiles",      # x-axis label
  y="Standardized Residuals") +   # y-axis label +
  geom_abline(color = "red",
              size = 0.8,
              linetype="dashed")  # dashed reference line

Our residuals deviate a considerable deal from a normal distribution.

mod %>%
  ggplot(aes(.fitted, 
             sqrt(abs(.stdresid)))) + 
  geom_point(na.rm=TRUE) + 
  stat_smooth(method="loess",
              na.rm = TRUE) +
  labs(title = "Scale-Location",
       x= "Fitted Value",
       y = expression(sqrt("|Standardized residuals|")))

  • The residuals are not equally spread. Our model has violated the assumption of \(homoscedasticity\) (equal variance)
    • The violation is cause for is worry.
mod %>%
  ggplot(aes(.hat, .stdresid)) + 
  geom_point(aes(size=.cooksd), na.rm=TRUE) +
  stat_smooth(method="loess", na.rm=TRUE) +
  xlab("Leverage")+ylab("Standardized Residuals") + 
  ggtitle("Residual vs Leverage Plot") + 
  scale_size_continuous("Cook's Distance", range=c(1,5)) +    
  theme(legend.position="bottom")

  • We can see no outliers that would drastically change the results of our model.
mod %>%
  ggplot(aes(.hat, .cooksd)) + 
  geom_point(na.rm=TRUE) + 
  stat_smooth(method="loess", na.rm=TRUE) + 
  xlab("Leverage hii")+ylab("Cook's Distance") + 
  ggtitle("Cook's dist vs Leverage hii/(1-hii)") + 
  geom_abline(slope=seq(0,3,0.5), color="gray", linetype="dashed")

  • Here we see indication of a somewhat more influential data point (still within acceptable range).

Cross Validation

predictions <- mod %>% predict(validate_data)

data.frame( R2 = caret::R2(predictions, validate_data$votos),
            RMSE = caret::RMSE(predictions, validate_data$votos),
            MAE = caret::MAE(predictions, validate_data$votos),
            ERR = caret::RMSE(predictions, validate_data$votos)/
              mean(validate_data$votos))

Now let’s talk about the results taken from the validate data (more meaningful).

  • We got a decent 0.58 R² and adjusted R² approximately. This means that this model explain approximately 58% of the response variable variability.
  • The average difference between the observed known outcome values and the values predicted by the model (RMSE) was of approximately 26836.56 .
    • Our model would miss the mark by approximately 26837 (RMSE), that is if candidate had one million votes we would predict 26837 more than we should (or less than we should).
  • The average absolute difference between observed and predicted outcomes (MAE) was approximately 13791.46.
  • The prediction error rate (ERR) was 1.215642.
predictions <- mod %>% predict(test_data)

data.frame( R2 = caret::R2(predictions, test_data$votos),
            RMSE = caret::RMSE(predictions, test_data$votos),
            MAE = caret::MAE(predictions, test_data$votos),
            ERR = caret::RMSE(predictions, test_data$votos)/
              mean(test_data$votos))

Now let’s talk about the results taken from the test data (most meaningful).

  • We got a decent 0.40 R² and adjusted R² approximately. This means that this model explain approximately 58% of the response variable variability.
  • The average difference between the observed known outcome values and the values predicted by the model (RMSE) was of approximately 42552.28 .
    • Our model would miss the mark by approximately 42552 (RMSE), that is if candidate had one million votes we would predict 42552 more than we should (or less than we should).
  • The average absolute difference between observed and predicted outcomes (MAE) was approximately 14855.39.
  • The prediction error rate (ERR) was 1.888835.

In the model for both elections we could see signs of the model stuggling to fit the data similar to the 2006 elections model but way more preoccupying. At the stage of cross validation our expectations were met, and the model for both elections fared worse than the rest, especially in terms of test data.